Fundamentals of Data Science - Week 2 and Week 3

Scroll down to the bottom of the notebook to see your assignment

Deadline: **27.09.2017 (Wednesday) at 23:55 CEST**

In this notebook we are going to cover the following practical aspects of data science:

  • Gathering data (scraping the Twitter Streaming API)
  • Storing and organizing it (store to file or a database)
  • Preprocess the data
  • Perform sentiment, topical and correlation analysis
  • Visualize

To complete this assignment you need to have a running Anaconda installation with Python 2.7 on your device. If this is not the case, refer back to Week 1. Python package prerequisites include:

  • Twitter API Client Tweepy                                 [Install command: pip install tweepy]
  • Python Data Analysis Library Pandas             [Install command: pip install pandas]
  • Python Visualization Library MatPlotLib          [Install command: python -m pip install matplotlib]
  • Python-Mongo Database Client PyMongo       [Install command: python -m pip install pymongo]
  • Python Topic Modelling Library GENISM        [Install command: pip install --upgrade gensim]

An additional requirement if you would like to use a database is MongoDB:

MongoDB stuff

If MongoDB is installed on your device and a database named Twitter is created, the tweets can be stored as database entries using the following code:

Note: If Mongo is not installed on your device it will yield a Connection Refused exception.


In [1]:
from pymongo import MongoClient

client = MongoClient()
db = client.Twitter
#db = client.tweets_sample

In [2]:
from pprint import pprint

In [3]:
import pandas as pd
import geopandas as gpd

In [4]:
import numpy as np
import matplotlib.pyplot as plt

In [5]:
import time

Load necessary data from MongoDB


In [6]:
start_time = time.time()
#we are filtering out tweets of different languages and outside of the US
filter_query = { 
    "$and":[ {"place.country_code":"US"}, { "lang": "en" } ]
    }
#we are keeping only our fields of interest
columns_query = {
    'text':1,
    'entities.hashtags':1,
    'entities.user_mentions':1,
    'place.full_name':1,
    'place.bounding_box':1
}

tweets = pd.DataFrame(list(db.tweets.find(
    filter_query, 
    columns_query
    )#.limit()
                               )
                          )
elapsed_time = time.time() - start_time
print elapsed_time


19.6721830368

In [7]:
tweets.drop(['_id'],axis=1,inplace=True)

In [8]:
tweets.head()


Out[8]:
entities place text
0 {u'user_mentions': [{u'id': 813286, u'indices'... {u'bounding_box': {u'type': u'Polygon', u'coor... @BarackObama \n@FBI\n@LORETTALYNCH \nALL IN CO...
1 {u'user_mentions': [], u'hashtags': [{u'indice... {u'bounding_box': {u'type': u'Polygon', u'coor... #CNN #newday clear #Trump deliberately throwin...
2 {u'user_mentions': [{u'id': 4852163069, u'indi... {u'bounding_box': {u'type': u'Polygon', u'coor... @mike4193496 @realDonaldTrump I TOTALLY CONCUR...
3 {u'user_mentions': [{u'id': 1339835893, u'indi... {u'bounding_box': {u'type': u'Polygon', u'coor... @HillaryClinton you ARE the co-founder of ISIS...
4 {u'user_mentions': [{u'id': 25073877, u'indice... {u'bounding_box': {u'type': u'Polygon', u'coor... @realDonaldTrump, you wouldn't recognize a lie...

In [9]:
print len(tweets)


517724

Extract data we need into their own columns (links, mentions, hashtags)

  • Deal with hyperlinks

In [10]:
import re

# A function that extracts the hyperlinks from the tweet's content.
def extract_link(text):
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

# A function that checks whether a word is included in the tweet's content
def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

In [11]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

In [12]:
#remove links
tweets['text'] = tweets['text'].apply(lambda tweet: re.sub(r"http\S+", "", tweet))
  • Deal with hashtags & mentions

In [13]:
#Functions to extract hashtags and mentions from entities
def extract_hashtags(ent):
    a=[]
    [a.append(hasht['text'].lower()) for hasht in ent['hashtags']]
    #[a.append(hasht['text']) for hasht in ent['hashtags']]
    return a
def extract_mentions(ent):
    users=[]
    [users.append(usr_ment['screen_name'].lower()) for usr_ment in ent['user_mentions']]
    #[users.append(usr_ment['screen_name']) for usr_ment in ent['user_mentions']]
    return users

In [14]:
tweets['hashtags'] = map(extract_hashtags,tweets['entities'])
tweets['mentions'] = map(extract_mentions,tweets['entities'])
tweets.drop(['entities'],axis=1,inplace=True)

In [15]:
tweets['state'] = map(lambda place_dict: place_dict['full_name'][-2:] ,tweets['place'])
tweets['geography'] = map(lambda place_dict: place_dict['bounding_box'] ,tweets['place'])
tweets.drop(['place'],axis=1,inplace=True)

In [16]:
#make all text lowercase
tweets['text'] = tweets.text.apply(lambda x: x.lower())

In [17]:
tweets.columns


Out[17]:
Index([u'text', u'link', u'hashtags', u'mentions', u'state', u'geography'], dtype='object')

In [ ]:

efforts with SentiStrength


In [18]:
#SentiStrength
import subprocess
jar_path = "/home/antonis/sentistrength/SentiStrength.jar"
senti_data_path = "/home/antonis/sentistrength/SentiData/"

In [19]:
#define a SentiStrength function which takes a string or tokenized input and returns sentiment scores
## TODO : caution: may be slow! (using ''.join )
sample_text = 'this is something'
def SentiStrength(sample_text):
    '''Returns a list of [positive, negative] values'''
    if type(sample_text) is str:
        return subprocess.check_output(['java', '-jar', jar_path, 'sentidata', senti_data_path ,'text',sample_text]).split()
    else:
        return subprocess.check_output(['java', '-jar', jar_path, 'sentidata', senti_data_path ,'text','+'.join(sample_text)]).split()

In [ ]:

Use externally trained model

IMPORTANT:

we should use the same vectorizer as in the trained instance of the classifier our training and prediction matrices have to be the same


In [21]:
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import random

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))


stemmer = SnowballStemmer('english')
analyzer = CountVectorizer().build_analyzer()
  • Externally trained NaiveBayes classifier on 67k tweets

In [22]:
#import the model
from sklearn.externals import joblib

import pickle
clf = joblib.load('trained models/NaiveBayes67k_chi2descr9-26_22,52.pkl') #the model itself
selector = joblib.load("trained models/selector_chi267k_chi2descr9-26_22,52.pkl") #chi2 feature selector
vect = joblib.load('trained models/vectdescr9-26_22,52.pkl') #Countvectorizer along with its vocabulary

In [23]:
from nltk import tokenize

In [ ]:


In [24]:
#rtransform our 
start_time = time.time()

results = clf.predict(selector.transform(vect.transform(tweets['text'])))

elapsed_time = time.time() - start_time
print elapsed_time


102.259329081

In [25]:
#try with probability!
start_time = time.time()

results_prob = clf.predict_proba(selector.transform(vect.transform(tweets['text'])))

elapsed_time = time.time() - start_time
print elapsed_time


112.642943859

In [ ]:


In [26]:
results = pd.Series(results)
results_prob = pd.DataFrame(results_prob,columns=['negative','positive'])

In [27]:
tweets['NB'] = pd.Series(results)
tweets['NB_prob+'] = pd.Series(results_prob['positive'])

In [28]:
tweets.head()


Out[28]:
text link hashtags mentions state geography NB NB_prob+
0 @barackobama \n@fbi\n@lorettalynch \nall in co... https://t.co/5GMNZq40V3 [nojustice, trumppence] [barackobama, fbi, lorettalynch, realdonaldtrump] LA {u'type': u'Polygon', u'coordinates': [[[-91.2... 0 0.268989
1 #cnn #newday clear #trump deliberately throwin... [cnn, newday, trump, isis] [] MD {u'type': u'Polygon', u'coordinates': [[[-76.7... 0 0.200862
2 @mike4193496 @realdonaldtrump i totally concur... [] [mike4193496, realdonaldtrump] MD {u'type': u'Polygon', u'coordinates': [[[-76.5... 0 0.059660
3 @hillaryclinton you are the co-founder of isis... [] [hillaryclinton] TX {u'type': u'Polygon', u'coordinates': [[[-97.0... 1 0.965413
4 @realdonaldtrump, you wouldn't recognize a lie... https://t.co/pKSQM8yikm [nevertrump] [realdonaldtrump] CA {u'type': u'Polygon', u'coordinates': [[[-116.... 1 0.936727

In [35]:
# manually inspect some results
for i in range(1,5):
    j = random.randint(0,len(tweets))
    tw = tweets['text'][j]
    print tw, '\n',tweets.ix[j,'NB_prob+'],' positive'
    print ''


@hughhewitt @noltenc @realdonaldtrump hugh, you're right. #basketoofdeplorables he'll hit again, and again. 
0.86193296115  positive

@karoli @davidshuster @hillaryclinton david might have deleted that after properly put him in line 
0.320181959525  positive

@realdonaldtrump and, btw, no need to look at my tax returns 
0.331246620146  positive

@realdonaldtrump do we need driver license at bank? yes. do we need driver license to take exams? yes. do we need driver license to vote? no 
0.416220810491  positive


In [37]:
#check the distribution of our probability estimates of the tweets
tweets['NB_prob+'].hist()
plt.show()


investigate the possibility of introducing a 'neutral' sentiment class

In [39]:
sa_results = map(lambda x: 'positive' if x>0.7 else 'negative' if x<0.3 else 'neutral' , tweets['NB_prob+'])

In [40]:
sa_results = pd.Series(sa_results)

In [41]:
sa_results.value_counts().plot(kind='bar', title='# of tweets flagged in each sentiment class / at the 30-70% threshold')
plt.show()



In [42]:
tweets.head()


Out[42]:
text link hashtags mentions state geography NB NB_prob+
0 @barackobama \n@fbi\n@lorettalynch \nall in co... https://t.co/5GMNZq40V3 [nojustice, trumppence] [barackobama, fbi, lorettalynch, realdonaldtrump] LA {u'type': u'Polygon', u'coordinates': [[[-91.2... 0 0.268989
1 #cnn #newday clear #trump deliberately throwin... [cnn, newday, trump, isis] [] MD {u'type': u'Polygon', u'coordinates': [[[-76.7... 0 0.200862
2 @mike4193496 @realdonaldtrump i totally concur... [] [mike4193496, realdonaldtrump] MD {u'type': u'Polygon', u'coordinates': [[[-76.5... 0 0.059660
3 @hillaryclinton you are the co-founder of isis... [] [hillaryclinton] TX {u'type': u'Polygon', u'coordinates': [[[-97.0... 1 0.965413
4 @realdonaldtrump, you wouldn't recognize a lie... https://t.co/pKSQM8yikm [nevertrump] [realdonaldtrump] CA {u'type': u'Polygon', u'coordinates': [[[-116.... 1 0.936727

In [ ]:

attach tweet to politician


In [43]:
def trump_in_text(tweet):
        '''This function takes the text of a tweet and 
            returns true if there is a mention to  Donald Trump or false 
            on the other hand.'''
        if ('donald' in tweet.lower()) or ('trump'  in tweet.lower()):
            return True
        return False
    
def clinton_in_text(tweet):
        '''This function takes the text of a tweet and 
            returns true if there is a mention to  Hillary Clinton  or false 
            on the other hand.'''

        if ('hillary' in tweet.lower()) or ('clinton'  in tweet.lower()):
            return True
        return False

def categorize(tr,hil):
        '''This function  categorizes each tweet based
           on text '''
        if tr==hil:
            return 'irrelevant'
        elif tr:
            return 'Trump'
        else:
            return'Clinton'

In [45]:
tweets['Trump'] = tweets['text'].apply(lambda tweet: trump_in_text(tweet))
tweets['Clinton'] = tweets['text'].apply(lambda tweet: clinton_in_text(tweet))
tweets['Politician']=map(lambda tr_col, hil_col: categorize(tr_col, hil_col), tweets['Trump'],tweets['Clinton'])

In [46]:
tweets.drop(['Trump','Clinton','geography'],axis=1,inplace=True)

In [47]:
tweets.head()


Out[47]:
text link hashtags mentions state NB NB_prob+ Politician
0 @barackobama \n@fbi\n@lorettalynch \nall in co... https://t.co/5GMNZq40V3 [nojustice, trumppence] [barackobama, fbi, lorettalynch, realdonaldtrump] LA 0 0.268989 Trump
1 #cnn #newday clear #trump deliberately throwin... [cnn, newday, trump, isis] [] MD 0 0.200862 Trump
2 @mike4193496 @realdonaldtrump i totally concur... [] [mike4193496, realdonaldtrump] MD 0 0.059660 Trump
3 @hillaryclinton you are the co-founder of isis... [] [hillaryclinton] TX 1 0.965413 Clinton
4 @realdonaldtrump, you wouldn't recognize a lie... https://t.co/pKSQM8yikm [nevertrump] [realdonaldtrump] CA 1 0.936727 Trump

In [48]:
tweets.Politician.unique()


Out[48]:
array(['Trump', 'Clinton', 'irrelevant'], dtype=object)

In [49]:
#### give a label to each tweet (ex. pro-Trump / anti-Hillary etc)
def label_tweet(pol,sent, upper_threshold=0.5):
    '''Label tweet depending on politician and sentiment.
    Return neutral if politician unknown or Naive Bayes prob close to 0.5
    otherwise return Politician initials and +/-'''
    if (((sent<upper_threshold) and (sent>(1-upper_threshold))) or (pol=='irrelevant')):
        return 'N'
    if (pol=='Trump'):
        label='T'
    if (pol=='Clinton'):
        label='C'
    if (sent>0.5):
        return label+'+'
    if (sent<0.5):
        return label+'-'
    return 'error'

In [ ]:


In [50]:
tweets['label'] = map(lambda name,sent: label_tweet(name,sent) ,tweets['Politician'],tweets['NB_prob+'])

In [51]:
tweets.head()


Out[51]:
text link hashtags mentions state NB NB_prob+ Politician label
0 @barackobama \n@fbi\n@lorettalynch \nall in co... https://t.co/5GMNZq40V3 [nojustice, trumppence] [barackobama, fbi, lorettalynch, realdonaldtrump] LA 0 0.268989 Trump T-
1 #cnn #newday clear #trump deliberately throwin... [cnn, newday, trump, isis] [] MD 0 0.200862 Trump T-
2 @mike4193496 @realdonaldtrump i totally concur... [] [mike4193496, realdonaldtrump] MD 0 0.059660 Trump T-
3 @hillaryclinton you are the co-founder of isis... [] [hillaryclinton] TX 1 0.965413 Clinton C+
4 @realdonaldtrump, you wouldn't recognize a lie... https://t.co/pKSQM8yikm [nevertrump] [realdonaldtrump] CA 1 0.936727 Trump T+

export for each state


In [52]:
from data.US_states import states

In [53]:
#initialize a df indexed by label values
state_sentiment = pd.DataFrame(index=tweets.label.unique())

In [54]:
for state in states.keys():
    state_sentiment[state] = tweets[tweets['state']==state]['label'].value_counts()
state_sentiment = state_sentiment.transpose()

In [55]:
state_sentiment.describe()


Out[55]:
T- C+ T+ C- N
count 51.000000 51.000000 51.000000 51.000000 51.000000
mean 1708.549020 830.196078 2367.372549 667.294118 1922.568627
std 2498.403181 1160.520418 3417.309076 931.886802 2717.523522
min 51.000000 27.000000 74.000000 25.000000 43.000000
25% 397.000000 227.000000 521.000000 154.000000 461.500000
50% 715.000000 399.000000 1264.000000 312.000000 1036.000000
75% 1938.000000 976.000000 2621.000000 842.500000 2158.000000
max 13842.000000 6388.000000 18079.000000 5082.000000 14284.000000

In [56]:
state_sentiment.head()


Out[56]:
T- C+ T+ C- N
WA 2276 1027 3150 833 2300
WI 687 331 1332 286 1281
WV 291 81 430 44 161
FL 7685 3595 10974 3100 9331
WY 51 56 119 49 71

In [155]:
pickle.dump(state_sentiment,open('results/state_sentiment_0.5.pickle','wb'))

In [ ]:

Investigate tweets using hashtags


In [93]:
def get_sentiment_from_hashtag(hashtag_list,
                               anti_trump_list=set(['nevertrump','dumptrump']), anti_hillary_list=set(['lockherup']),
                               pro_trump_list=set(['trumptrain']), pro_hillary_list=set(['imwithher'])
                              ):
    '''Given a list of hashtags, classify the tweet positively (1), neutral (0) or negatively(-1)'''
    hashtag_list = set(hashtag_list)
    pro_trump=anti_trump=pro_hillary=anti_hillary = False
    negative=positive = False
    if (len(hashtag_list.intersection(anti_trump_list))>0) | (len(hashtag_list.intersection(anti_hillary_list))>0):
        negative = True
    
    if (len(hashtag_list.intersection(pro_hillary_list))>0) | (len(hashtag_list.intersection(pro_trump_list))>0):
        positive = True

    if positive == negative:
        return 0 #both negative+positive hashtags
    if positive:
        return 1
    if negative:
        return -1
    return 0 #no 'explanatory' hashtag found

In [97]:
#test our function
print get_sentiment_from_hashtag(['lockherup', 'trumptrain'])
print get_sentiment_from_hashtag(['lockherup', 'dumptrump'])


0
-1

In [102]:
np.unique([tweets.hashtags.apply(lambda x: get_sentiment_from_hashtag(x))])


Out[102]:
array([-1,  0,  1])

In [106]:
tweets['sentiment_hashtag'] = tweets.hashtags.apply(lambda x: get_sentiment_from_hashtag(x))

In [107]:
tweets.head()


Out[107]:
text link hashtags mentions state NB NB_prob+ Politician label sentiment_hashtag
0 @barackobama \n@fbi\n@lorettalynch \nall in co... https://t.co/5GMNZq40V3 [nojustice, trumppence] [barackobama, fbi, lorettalynch, realdonaldtrump] LA 0 0.268989 Trump T- 0
1 #cnn #newday clear #trump deliberately throwin... [cnn, newday, trump, isis] [] MD 0 0.200862 Trump T- 0
2 @mike4193496 @realdonaldtrump i totally concur... [] [mike4193496, realdonaldtrump] MD 0 0.059660 Trump T- 0
3 @hillaryclinton you are the co-founder of isis... [] [hillaryclinton] TX 1 0.965413 Clinton C+ 0
4 @realdonaldtrump, you wouldn't recognize a lie... https://t.co/pKSQM8yikm [nevertrump] [realdonaldtrump] CA 1 0.936727 Trump T+ -1

Benchmark my classifier on negative hashtag tweets


In [116]:
negative_tweets = clf.predict(selector.transform(vect.transform(tweets.text[tweets.sentiment_hashtag==-1])))
negative_idx = tweets.text[tweets.sentiment_hashtag==-1].index

In [117]:
for i in range(1,5):
    j = random.randint(0,len(negative_idx))
    print tweets.ix[j,'NB_prob+'],tweets.text[j],'\n'


0.998751269877 @dds_officer thx 4 the follow. i look 4ward 2 learning n 2 sharing. best, nick. pls visit me at www.  #nevertrump 

0.714839142351 @rogerjstonejr how is the best dressed man in nyc tonight? wishing you well and praying for you tonight. #maga 

0.070043121167 i hope they release everything soon. i think she'll have 2 quit due to illness, but either way we need to go.   

0.367458973798 all of you so called #conservatives will be putting a lib dem to appoint if you don't vote #trump.  it's more than 4 years at stake #gop 

/home/antonis/anaconda2/envs/USelections/lib/python2.7/site-packages/ipykernel/__main__.py:3: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  app.launch_new_instance()

In [118]:
pd.Series(negative_tweets).value_counts()


Out[118]:
1    10429
0     8468
dtype: int64

We can see that there is a problem in our sentiment classifier, as we would expect it to label more tweets negatively

We should investigate ways to improve it, such as train on a political corpus dataset, use bigrams etc. For more on that, please check the discussion section in our assignment.